HANOI UNIVERSITY OF SCIENCE AND TECHNOLOGY
GRADUATION THESIS
Applying big data analysis technology and a Graph
Neural Network model to the call spam problem in
telecommunications networks
NGUYEN TRUNG KIEN
Kien.NT212106M@sis.hust.edu.vn
Major: Data Science
Specialization:Computer Science
Supervisor: PhD. Vu Tuyet Trinh
Signature
Department: Computer Science
School: School of Information and Communications Technology
HANOI, 10/2023
ĐT.QT12.BM12 Lần ban hành: 02 Ngày ban hành: 28/04/2023
CỘNG HÒA XÃ HỘI CHỦ NGHĨA VIỆT NAM
Độc lập – Tự do – Hạnh phúc
BẢN XÁC NHẬN CHỈNH SỬA LUẬN VĂN THẠC SĨ
Họ và tên tác giả luận văn : Nguyễn Trung Kiên
Đề tài luận văn: Ứng dụng ng nghệ phân tích dữ liệu lớn mô hình mạng
nơ ron đthị vào bài toán dự đoán thuê bao spam thoại trong mạng viễn thông
Ngành: Khoa học dữ liệu
Mã số SV: 20212106M
Tác giả, Người hướng dẫn khoa học và Hội đồng chấm luận văn xác nhận tác
giả đã sửa chữa, bổ sung luận văn theo biên bản họp Hội đồng ngày 28/10/2023 với
các nội dung sau:
1. Cần bổ sung vào chương 2 các nghiên cứu liên quan hiện nay: Đã bổ sung các
nghiên cứu. Chi tiết từ Reference [1] [19]. Đã nêu các vấn đề trong mục 1.3
(Problems of Research).
2. Bổ sung flowchart để hơn luồng xử lý. Trình bày theo hướng MLOps: Đã bổ
sung Data flow diagram (hình 3.3) mô tả 6 luồng dữ liệu chính trong thiết kế. Trong
đó, luồng tối ưu cài đặt hình được thiết kế theo hướng MLOps để thực hiện
liên tục.
3. Chỉnh sửa các hình còn mờ: Đã loại bỏ các hình mờ, bổ sung các hình ràng
hơn. Điều chỉnh trình bày làm rõ hơn các frame work sử dụng để cài đặt các luồng
Bigdata.
4. Mô tả hàm loss khi huấn luyện mô hình: Đã bổ sung mô tả cho hàm loss khi huấn
luyện và đánh giá mô hình( hình 4.8).
Ngày tháng năm
Giáo viên hướng dẫn Tác giả luận văn
CHỦ TỊCH HỘI ĐỒNG
ACKNOWLEDGMENT
Firstly, I would like to express my gratitude to Professor Vu Tuyet Trinh for her in-
valuable guidance, support, and mentorship throughout the course of this research.
Upon receiving the research direction to address the issue of spam calls, I was faced
with a daunting challenge. This is a significant problem that plagues society, and I
was apprehensive due to my limited experience in academic research.
However, with Professor Trinhs meticulous guidance and specific orientations
for each stage over the course of two years, my ideas and implementation meth-
ods gradually became clearer. After several unsuccessful attempts to register the
article, the paper named A novel method for spam call detection using graph con-
volutional networks” was finally accepted for presentation at the 16th ACIIDS (
Asian Conference on Intelligent Information and Database Systems) conference in
2023. This marked a significant milestone in the researchs initial success.
In addition, during the process of completing the project, Professor Tran Viet
Trung provided invaluable assistance in building and optimizing the big data pro-
cessing flow architecture and data analysis. With his guidance, I was able to com-
plete my data set as input for the prediction model.
At SOICT, the professors are always enthusiastic, serious, and professional in
guiding graduate students on their academic path. This is a crucial stepping stone
for me to complete further research
Furthermore, I would like to extend my sincere appreciation to Viettel Group for
providing me with the opportunity to apply my research specif ically to the problem
of user protection. Witnessing my research come to life in this manner has been an
incredibly meaningful experience for me as a graduate student.
ABSTRACT
Malicious call seriously troubles our daily life. Mobile users not only feel annoyed
by the advertising messages but also get scammed. Million people are suffocated
by increasingly unwanted calls that are advertising or phishing every day.
Several software programs have been developed to prevent mobile phone users
from being disturbed by spam calls. Subsequently, numerous machine learning
models were developed to forecast subscribers who disseminate spam calls. How-
ever, the classification models permanently deal with the ambiguity between spam
calls and regular ones. Additionally, previous machine learning-based research
does not emphasize an important feature that pertains to the relationship between
mobile users.
Besides, Various independent entities have created applications to tackle the
problem of spam calls, such as True Caller. Nonetheless, a comprehensive resolu-
tion to this issue necessitates the participation of network operators.
In this research, I present a framework that uses network-relationship between
mobile users for spam detection. The framework has three main contributions: i)
constructs a big data processing architecture with telecommunications data that
collects call signals to synthesize data features and determine spam or normal calls;
ii)constructs a telephony graph dataset for spam call problems, which is able to ex-
ploit in more detail the relationship among users relationships;iii) proposing a
graph neural network (GNN)-based model for the spam detection task. The exper-
iments show that our model outperforms strong baseline models in this research
field.
Student
(Signature and full name)
TABLE OF CONTENTS
CHAPTER 1. INTRODUCTION................................ ............................. 1
1.1 Problem Statement................................................... ............................. 1
1.2 Background Research.......................................................... .................. 2
1.3 Problem of Research ................... ............................ .............................. 3
1.4 Research Objectives................... ............................ ............................... 4
1.5 Conceptual Framework ............................................. ............................ 4
1.6 Contributions ............................ ............................ ............................... 4
1.7 Organization of Thesis ............................................. ............................. 5
CHAPTER 2. LITERATURE REVIEW ..................... ............................ . 6
2.1 Related work .. .............. .......................................... ............................ .. 6
2.2 Big data framework......... ............................ .......................................... 10
2.2.1 Apache Hadoop ..... .............. .......................................... ............ 10
2.2.2 Apache Spark .......... ............................ ...................................... 12
2.2.3 Apache Kafka........... ............................ ..................................... 15
2.3 Graph-based model. ....... .............. .......................................... ............... 16
2.3.1 Graph Basics .......... ............................ ....................................... 16
2.3.2 Real-life Graph Data and application . ....... ................................... 17
2.3.3 Graph-based Embedding.... ....... .............. .................................... 18
2.3.4 Graph Neural Network....................... ............................ ............. 21
CHAPTER 3. PROPOSED SOLUTION ............................................... ... 25
3.1 Overview ..... ....... ................................... .......................................... .... 25
3.2 Mobile Network Data Aggregation...................................................... ... 25
3.2.1 System Architecture ...... ....... .............. ........................................ 25
3.2.2 Collective spam labelling by end-users... .............. ........................ 26
3.2.3 Data flow Diagram........ ....... .............. ........................................ 28
3.3 Graph Structure-based Telephone Data Representation..... ....... .............. ... 29
3.4 Spam call detection using GNN ........................................... .................. 30
3.4.1 Graph neural network for Spam call detection........ .............. ......... 30
3.4.2 Proposed Model........ .............. .......................................... ......... 31
CHAPTER 4. EXPERIMENTS AND RESULTS ......... ............................ 33
4.1 Big data Framework Setting....................... ............................ ................ 33
4.2 Feature Engineering....... .............. ............................ ............................. 34
4.3 Dataset ...... ....... .......................................... ......................................... 35
4.4 Model Environment settings ............ ................................... .................. 35
4.4.1 Machine learning framework Installation............ .......................... 35
4.4.2 Based Model Hyper parameter ................. ....... .............. .............. 37
4.4.3 GNN Hyper parameter Setting ............. ............................ ........... 37
4.5 Evaluation standard ...... .............. ............................ .............................. 37
4.6 Main results ........................................................................ ................. 39
CHAPTER 5. CONCLUSIONS ................................ ............................... 45
5.1 Summary ........ .............. .......................................... ............................ . 45
5.2 Suggestion for Future Works ........... ................................................. ..... 45
REFERENCE ... .............. .......................................... ............................ ... 49
LIST OF FIGURES
Figure 1.1 One type of adapters enable spammers to change their mo-
bile numbers dynamically . . . . . . . . . . . . . . . . . . . . . . . . 2
Figure 2.1 The impact of call length on the classification [5] . . . . . . 8
Figure 2.2 Sensitivity and specificity [5] . . . . . . . . . . . . . . . . . . 9
Figure 2.3 Apache Hadoop [20] . . . . . . . . . . . . . . . . . . . . . . 11
Figure 2.4 Apache Spark Ecosystem [21] . . . . . . . . . . . . . . . . . 12
Figure 2.5 Apache kafka [22] . . . . . . . . . . . . . . . . . . . . . . . 15
Figure 2.6 Basic graph presentation . . . . . . . . . . . . . . . . . . . . 16
Figure 2.7 Real-life Graph Data [23] . . . . . . . . . . . . . . . . . . . 18
Figure 2.8 Graph Data Application [23] . . . . . . . . . . . . . . . . . . 19
Figure 2.9 Node Embedding in GCN . . . . . . . . . . . . . . . . . . . 22
Figure 3.1 System Architect . . . . . . . . . . . . . . . . . . . . . . . . . 26
Figure 3.2 Sur vey flash message . . . . . . . . . . . . . . . . . . . . . . 27
Figure 3.3 Data Flow Diagram . . . . . . . . . . . . . . . . . . . . . . . 28
Figure 3.4 A telephony network graph that is implemented by Random
walk Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
Figure 3.5 Node Embedding in GNN with Edge feature . . . . . . . . . 31
Figure 3.6 The proposed GNN Architecture . . . . . . . . . . . . . . . . 32
Figure 4.1 System Architecture Implementation . . . . . . . . . . . . . 33
Figure 4.2 Big data cluster on Apache Ambari dashboard . . . . . . . . 34
Figure 4.3 The first optimization Result . . . . . . . . . . . . . . . . . . 40
Figure 4.4 The second optimization result . . . . . . . . . . . . . . . . . 40
Figure 4.5 The third optimization result . . . . . . . . . . . . . . . . . . 41
Figure 4.6 The fourth optimization result . . . . . . . . . . . . . . . . . 41
Figure 4.7 Feature Importance . . . . . . . . . . . . . . . . . . . . . . . 42
Figure 4.8 Train loss and val loss . . . . . . . . . . . . . . . . . . . . . . 43
Figure 4.9 Precision and recall calculation [30] . . . . . . . . . . . . . . 43
Figure 4.10 GNN Confusion matrix . . . . . . . . . . . . . . . . . . . . . 44
Figure 4.11 Base Model Confusion Matrix . . . . . . . . . . . . . . . . . 44
i
LIST OF TABLES
Table 4.1 Server resource Lab . . . . . . . . . . . . . . . . . . . . . . . 34
Table 4.2 Batch parameters Optimization . . . . . . . . . . . . . . . . . 35
Table 4.3 Feature List . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
Table 4.4 Based Model Hyper Parameter . . . . . . . . . . . . . . . . . 38
Table 4.5 The hyper parameters of the proposed GCN model. . . . . . . 38
Table 4.6 Model result . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
Table 5.1 Feature of CDR log . . . . . . . . . . . . . . . . . . . . . . . . 51
ii
LIST OF ABBREVIATIONS
Abbreviation Definition
ACIIDS Asian Conference on Intelligent
Information and Database Systems
CDR Call Detail Record
GCN Graph Convolutional Neural Network
GMSC Gateway Mobile Switching Center
GNN Graph Neural Network
iii
CHAPTER 1. INTRODUCTION
1.1 Problem Statement
Malicious call seriously troubles our daily life. A spam call is a more universal
ter m that encompasses any type of unwanted call received on your phone line. This
category includes robocalls, solicitor calls, fake calls, spoofed calls, and more. Mo-
bile users not only feel annoyed by the advertising messages but also get scammed.
Million people are suffocated by increasingly unwanted calls that are advertising
or phishing every day. In Viet Nam, According to a survey conducted by Viettel
Group, approximately 14,179,748 Vietnamese mobile users reported being dis-
turbed by 39,597,469 spam calls in Febr uary 2020.
There are several factors contributing to the rise in spam calls. In recent times,
advancements in artificial intelligence have made it easier for spammers to employ
automated calling systems capable of generating thousands of calls within min-
utes. Additionally, certain types of adapters enable spammers to change their mo-
bile numbers dynamically and modify their behavior to evade conventional spam
detection mechanisms. The figure 1.1 illustrates the concurrent usage of multiple
mobile numbers.
Several software programs have been developed to prevent mobile phone users
from being disturbed by spam calls. These programs function by blocking calls
made from phone numbers that are either on the list of phone numbers that need to
be blocked or are not on the device’s phone number list.
Subsequently, numerous machine learning models were developed to forecast
subscr ibers who disseminate spam calls. However, the classification models per-
manently deal with the ambiguity between spam calls and regular ones. In some
cases, phishing call is the same behavior as normal ones. on the other hand, with
the development of internet-based services such as food shippers, logistic service,
customer care services, and regular anonymous call increase extremely.
Various independent entities have created applications to tackle the problem of
spam calls, such as True Caller. Nonetheless, a comprehensive resolution to this
issue necessitates the participation of network operators. Consequently, it is imper-
ative to devise a new approach on the telephone operator side that can automatically
identify unwanted calls.
1
CHAPTER 1. INTRODUCTION
Figure 1.1: One type of adapters enable spammers to change their mobile numbers dy-
namically
1.2 Background Research
The previous works mentioned multiple approaches to the problem by applying
some rules such as black/white lists, enforced caller introduction, and call rate
limiting [1], [2]. MacIntosh et. al [3] based on the analysis of the VoIP signaling
messages in order to assist service providers in detecting spam activity targeting
their customers.
Lentzen et. al [4] proposed content-based detection that uses a database of fea-
ture vectors, and new calls are compared with previous ones. Elizalde et. al. [5] ana-
lyzed voice feature to filter spam voicemails. A robust audio fingerprint of spectral
feature vectors is computed for incoming audio data.
On the other hand, Chaisamran et. al [6] propose a trust-based mechanism that
uses the duration call and call direction between two users. The trust value is ad-
justable according to the calling behavior. Furthermore, a trust inference mecha-
nism is also proposed in order to calculate a trust value for an unknown caller to a
listener.
Recent works focus on applying Machine learning (ML) models for spam user
2
CHAPTER 1. INTRODUCTION
classification. Specifically, Naive Bayes, K-Nearest Neighbors, SVM, Logistic re-
gression, and Random forest are well-known models, which are used to detect
email spam [7]. [8] focused on using Recurrent Neural Network (RNN) algorithm
to detect the malicious calls. With the proposed features, we review different state
of the art methods of machine-learning, and it is inferred that the most optimal
approach can minimize malicious calls to 0.9 while keeping over 0.9 of the binary
call accuracy. The outcomes also show that without significant overhead latency
the models can be implemented effectively with the help of an evaluation anal-
ysis. Moreover, Li et. al. [9] applied Random Forest, XGBoost, RNN, and SVM
with 29 features for mobile applications. These models bases on behavior features
of mobile numbers. It is effective to filter obvious cases, for instance, robot calls.
However, in the ambiguity cases that are mentioned above, those aforementioned
models are still difficult to classify ambiguity cases.
1.3 Problem of Research
Consequently, with server-side and machine learning-based approach, it is im-
perative to establish the methodology for collecting and analyzing user behavior
features for the machine learning model. Gathering feature data and labeled spam
users necessitates a big data processing solution. The data is continuously updated
to enhance the outdated prediction model.
Furthermore, There is not a data set with structured to graph form that presents
complicated user behaviors such as brand-named advertisement calls, food ship-
pers, logistic services, Salesman. . . User behaviors are intricate. With the advent
of internet-based services such as brand-named advertisement calls, food shippers,
logistic services, etc., anonymous calls have proliferated. Salesmens mobile num-
bers are used for both marketing and personal purposes. Every day, thousands of
new subscribers connect to the mobile network for spamming without any histori-
cal data.
Additionally, previous machine learning-based research does not emphasize an
important feature that pertains to the relationship between mobile users. The tele-
phony network is a naturally extensive graph structure connecting subscribers.
There are multiple scenarios in call behavior. Normal mobile users engage in vari-
ous groups such as family, friends, company, and customer service. A hotline num-
ber associated with a company is identified by its brand name. However, abnormal
subscr ibers who remain anonymous always exhibit irregular relationships. There-
fore, spam classification is well-suited for a graph neural network model. This
model combines call behavior features with telecommunication relationships to
3
CHAPTER 1. INTRODUCTION
yield better results in spam detection
1.4 Research Objectives
To address the aforementioned issues, this research concentrates on construct-
ing an exhaustive data set on voice spam subscr ibers’ conduct from the standpoint
of telecommunications service providers. Subsequently, a novel prediction model-
Graph neural network, is examined, leveraging the association and behavioral at-
tr ibutes among telecommunications subscribers to enhance efficiency in compari-
son to prior research.
1.5 Conceptual Framework
To achieve the above goal, the study proposes the following 3 solutions:
Firstly, a comprehensive big data processing system was developed. Telecom-
munications logs were utilized to acquire data, and characteristics were synthesized
through statistical probability calculations. Features with a significant impact on
the model were selected. Furthermore, to gather labels for the data set, a real-time
processing flow was established to survey calls. Mobile users will classify calls as
either spam or non-spam.
Secondly, I propose a data set that applies graph structure to exploit the relation-
ship between mobile users to detect spam users. This data set serves as the input to
the prediction model. In this regard, representing the problem in g raph structure is
able to model the telephony relationship effectively. Regular users interact in some
regular groups such as family, friends, companies, and customer services. On the
other hand, abnormal users always have weird relationships, which makes the calls
become unacquainted numbers.
Lastly, it focuses on building the architecture and testing the Graph Convolu-
tional Neural Network (GCN) model for predicting voice spam subscribers. The
performance of this model will be compared with baseline models used in previ-
ous studies. [10] for learning our telephony network data, which is designed in a
graph structure.
1.6 Contributions
The study’s main contribution is threefold:
Firstly, constructing a big data processing architecture with telecommunications
data that collects call signals to synthesize data features and determine spam or
nor mal calls.
Secondly, constr ucting a telephony graph data set for spam call problems, which
is able to exploit in more detail the relationship among users relationships. To the
4
CHAPTER 1. INTRODUCTION
best of our knowledge, this is the first study that represents telephony data as a
graph structure for further exploitation.
Thirdly, presenting a classification model based on a graph neural network to
improve the performance of spam user detection
1.7 Organization of Thesis
In order to clearly and accurately demonstrate the proposed model, this project
is divided into five chapters. They are as follows:
Chapter 1: This chapter provides an overview of the problem, highlighting the
significance of the anti-spam call problem for telecommunications carriers. It also
presents an overview of related research and proposed solutions to address any
remaining problems.
Chapter 2: This chapter focuses on the related work for spam call detection, the
theoretical foundations of graph neural networks and big data analysis platforms.
Chapter 3: This chapter presents its ideas, system architecture, and proposed
model.
Chapter 4: This chapter is dedicated to testing and evaluating the quality of the
proposed model. It also includes a comparison with baseline models. Furthermore,
it provides a detailed description of the data set used and the tools employed.
Chapter 5: The final chapter concludes the project by discussing its limitations,
along with suggestions for future directions.
5
CHAPTER 2. LITERATURE REVIEW
2.1 Related work
In [1], the authors analyze the issue of spam in SIP. They begin by identifying
the similarities and differences between the problem of spam in SIP and email.
Subsequently, they explore the various solutions that have been proposed for email
and evaluate their potential applicability to SIP. They consider the various solutions
that might be possible to deal with SIP spam, such as Content Filtering, Black Lists,
White Lists, Reputation Systems, Limited-Use Addresses, Turing Tests, Payments
at Risk, Circles of Trust...
J. Peterson and C. Jennings, in [2], provide enhancements to the existing mech-
anisms for authenticated identity management in the Session Initiation Protocol
(SIP, [11]). An identity, for the purposes of this document, is defined as a SIP URI,
commonly a canonical address-of-record (AoR) employed to reach a user (such
as sip:alice@atlanta.example.com’). A cryptographic approach, like the one de-
scr ibed in this document, can probably provide a much stronger and less-spoofable
assurance of identity than the telephone network provides today.
In VoIP service network, attack traffic has abnormal characteristics according to
the kind of attack. Mass Call Spam (MCS) attack that spammers send via Internet
Telephony service to a large number of users has also an abnormal traffic pattern.
The paper [12] proposes an analysis scheme for detecting MCS in Session Initiation
Protocol (SIP)-based Internet Telephony service. In the proposed scheme, efficient
statistical analysis modules are used to detect abnormal traffic pattern by MCS.
In [3], the authors propose an innovative method to detect and block spam in
IP telephony networks. The rapid adoption of voice over IP (VoIP) technology has
introduced new, powerful options for spammers and telemarketers to increase their
productivity and effectiveness. While a few concepts have already been proposed in
the VoIP spam prevention area, these prior solutions have mainly focused on iden-
tity control and reliable authentication, following in the footsteps of email spam so-
lutions. Such measures imply at least strong collaboration of service providers and
universal standardization. In this paper, the authors describe a new method based
on the analysis of VoIP signaling messages, which can assist service providers in
detecting spam activity targeting their customers. This “locally centric” approach
would enable a service provider to handle the call before the actual voice spam
content reaches the recipient. The detection parameters depend solely on the local
ser vice provider’s policy; no end-users participation or compliance is required.
6
CHAPTER 2. LITERATURE REVIEW
In [13], a user-behavior-aware anti-SPIT technique implemented at the router
level for detecting and filtering SPIT is proposed. The rationale for the technique is
that voice spammers behave significantly different from legitimate callers because
of their revenue-driven motivations. The technique defines and combines three fea-
tures developed from user behavior analyses to detect and filter spam calls. Com-
pared to existing SPIT defending techniques, it is simple, fast and effective. Other
advantages of our approach are that it is applicable for detecting and filtering both
machine-initiated and human-initiated spam calls, better protects VoIP calls against
sybil attacks and spammer behavior changes.
The work [14] aims to provide a simplified framework based on caller behav-
ior patterns to create an anti-spam approach. The strategy assumes that fraudsters
with a profit motive act differently than genuine callers and have a distinctive voice
pattern. Such unique patterns can be generalized and combined with simple math-
ematical approaches to aid in filtering spam calls. The suggested approach is ap-
propr iate for identifying spam calls in many contexts and is more effective than
current spam call defense strategies.
A new SPIT detection method using voice activity analysis is proposed in the
paper [15]. The calling and called voice saturation ratio and the called conflict
ratio is used as the parameters of the call-behavioral characteristics after an anal-
ysis of the difference in call behavior between spam voices and the normal calls.
And the parameters of the call-behavioral characteristics are obtained by analyz-
ing voice activity status of the calling and called. Finally, by using support vector
machine a spam classifier is designed and the experiment result shows the method
can effectively recognize spam call.
Lentzen et. al [4] proposed method for detecting spam calls is based on comput-
ing a robust audio f ingerprint of spectral feature vectors for incoming audio data.
The system then compares new calls with previous ones and detects replays with
identical or similar audio data. Depending on the policy, future calls from the same
source can then be blocked during call setup. A prototype based on this approach
has been developed and f irst results show that the system can effectively detect and
block spam calls
Telecommunication regulators and carriers have implemented automated sys-
tems to identify unwanted calls using Call Detail Records, which include call ori-
gin and call duration information. However, the actual audio content is often over-
looked. In [5] , researchers proposed an audio-based spam call detection method
that uses acoustic features of recorded voicemails to identify human calls from
7
CHAPTER 2. LITERATURE REVIEW
robocalls, and to distinguish spam calls from non-spam calls for human callers.
The results showed that voiced and unvoiced audio content carries sufficient dis-
cr iminatory information to distinguish between human and robocalls, as well as
between spam and non-spam calls. The method achieved 0.93 accuracy in distin-
guishing between human calls and robocalls, compared to 0.83 accuracy in dis-
tinguishing between spam and non-spam calls. The researchers expect that their
automated approach can serve as an auxiliary tool, in combination with other call
behavior statistics, to reduce the frequency of unwanted calls or fraudulent inci-
dents According to the study, researchers have proposed an audio-based spam call
Figure 2.1: The impact of call length on the classification [5]
detection method that uses acoustic features of recorded voicemails to distinguish
between human and robocalls, as well as between spam and non-spam calls. The
figure 2.1 show impact of call length on the classification of spam versus non-spam
calls was studied. The study used 100K-SVM runs to show unweighted accuracy,
true positive rate (TPR) for spam, and true negative rate (TNR) for non-spam. The
method achieved 0.93 accuracy in distinguishing between human calls and robo-
calls, compared to 0.75-0.83 accuracy in distinguishing between spam and non-
spam calls. The researchers also found that spam calls were twice or more likely to
come from a (perceived) female voice. In addition, the researchers looked into the
speed of annotation and found that on average it took 6.6 seconds for a reviewer to
decide whether a call was spam or non-spam, with reviewers taking slightly longer
to label a non-spam call (7.7 seconds) compared to a spam call (6.2 seconds).
On the other hand, Chaisamran et. al [6] have proposed a trust-based mechanism
that uses the duration of calls and call direction between users to distinguish legit-
8
CHAPTER 2. LITERATURE REVIEW
imate callers from spammers. The trust value is adjustable according to the calling
behavior. The researchers also proposed a trust inference mechanism to calculate a
trust value for an unknown caller to a callee. As figure 2.2, The researcher demon-
strated that even the number of spammers was increased, the accuracy of spam and
legitimate were still higher than 0.98 and 0.95 respectively. Moreover. the paper
[16] propose a SPIT detection algorithm based on user’s call behavior. Simulation
results show the efficiency of our detection method and outline the most signif-
icant parameters. they find that the inspection of call duration allows to quickly
and precisely discover SPIT messages. In the paper [17], the researchers propose
a model based on dynamic sliding windows to detect opt-in phone calls based on
mobile phone call detail records. This work is useful for detecting unwanted calls
(e.g., spam) and commercial purposes. For validation of our results, they used ac-
tual call logs of 100 users collected at MIT by the Reality Mining Project group for
a period of 8 months. The experimental results show that our model achieves good
performance with 0.91 accuracy.
Figure 2.2: Sensitivity and specificity [5]
Recent researches focus on applying Machine learning (ML) models for spam
user classification. The paper [18] proposes a model that uses the time duration
(length) of calls between two users to differentiate normal callers and spam callers.
Here, a process to notice spammers in VoIP is defined by classifying differences
in the Call Graph. This is directed by using call data records weighted graph and it
is created where the set of separating call parameters are used to define weights on
the edges. In this Voice over Internet Protocol (VOIP) spam Detection, associate
approach merging protocols and character istics of callers have been bestowed. This
9
CHAPTER 2. LITERATURE REVIEW
system provides a straightforward and humble way to use call period as associate
automatically selected model based on humanoid behavior. In [9], researchers have
developed a TouchPal user interface on top of a mobile app to allow users to tag ma-
licious calls, which helps maintain a large-scale call log database. They conducted
a measurement study over three months of call logs, including 9 billion records,
and designed 29 features based on the results. Machine learning algorithms can
use these features to predict malicious calls. The researchers extensively evaluated
different state-of-the-art machine learning approaches using the proposed features.
The best approach can reduce up to 0.9 unblocked malicious calls while main-
taining a good precision on the benign call traffic. The results also show that the
models are efficient to implement without incurring a significant latency overhead.
The researchers also conducted an ablation analysis, which revealed that using 10
out of the 29 features can reach a performance comparable to using all features.
[8] focused on using Recurrent Neural Network (RNN) algorithm to detect the
malicious calls. With the proposed features, they review different state of the art
methods of machine-learning, and it is inferred that the most optimal approach can
minimize malicious calls to 0.9 while keeping over 0.9 of the binary call accuracy.
The outcomes also show that without significant overhead latency the models can
be implemented effectively with the help of an evaluation analysis. Moreover, Li
et. al.
Recently, the paper [19] presents an overview of AI-based fraud or spam de-
tection and analysis techniques, along with its challenges and potential solutions.
The novel fraud call detection approach is proposed that achieved high accuracy
and precision. The Proposed approach was evaluated using a dataset of real-world
fraudulent calls. And results demonstrate that the approach achieved high accuracy
in detecting malicious calls and identifying potential indicators of frauds or spams.
The analysis of fraud calls also provided insights into the tactics and methods em-
ployed by fraudsters, which can be used to develop countermeasures.
2.2 Big data framework
2.2.1 Apache Hadoop
Hadoop is an open-source framework for distributed storage and processing of
extremely large data sets on commodity hardware. The core of Hadoop consists
of Hadoop Distributed File System (HDFS), MapReduce, and YARN. HDFS is
the primary distr ibuted storage used by Hadoop applications. MapReduce is the
programming model used for large-scale data processing. YARN is the resource
management layer that schedules jobs across the cluster.
10
CHAPTER 2. LITERATURE REVIEW
Figure 2.3: Apache Hadoop [20]
As Figure 2.3, the Hadoop ecosystem refers to the various components that in-
tegrate with or extend the core Hadoop to provide additional capabilities. These
include tools for data ingestion, data processing, data analytics, machine learning,
data visualization and more. The entire ecosystem is designed to tackle complex
Big Data problems.
HDFS is the foundation of the Hadoop ecosystem. It is a distributed file system
designed to reliably store very large files across machines in a Hadoop cluster. Files
are stored in redundant fashion across multiple nodes to enable fault tolerance.
HDFS follows a master-slave architecture with a namenode acting as the master
that manages the file system metadata and datanodes that store the actual data as
blocks.
The namenode maintains the filesystem tree and mappings of file blocks to
datanodes. Client applications talk to the namenode for file metadata and then
read/wr ite file data directly from/to the datanodes. Datanodes also perform block
creation, deletion and replication on instruction from the namenode. HDFS is op-
timized for large sequential reads and writes of data which is typical of many big
data use cases.
HDFS offers several benefits that make it suitable for distributed storage and
processing:
Scalability: HDFS clusters can scale horizontally simply by adding more com-
modity servers. This provides flexibility to handle increasing data volumes.
11
CHAPTER 2. LITERATURE REVIEW
High Availability: Data is replicated across multiple datanodes to provide re-
dundancy and fault tolerance in case of node failures. The secondary namenode
helps keep the primary namenode state synchronized.
Reliability: Checksumming of data blocks and automatic re-replication of under-
replicated data ensures detection and quick recovery from corruption.
Cost effectiveness: HDFS runs on low-cost commodity hardware reducing stor-
age costs while providing high aggregate bandwidth across the cluster.
Simplified data processing: Data locality optimization allows scheduling com-
putation near the data and minimizes network traffic.
The HDFS architecture, replication method, fault detection and recovery makes
it suitable for distributed storage and analysis of huge datasets typically found in big
data use cases. The integration of HDFS with MapReduce and YARN frameworks
provides a comprehensive platform for batch and real-time big data processing.
2.2.2 Apache Spark
Apache Spark is an open source distributed general-purpose cluster computing
framework ideal for large-scale data processing. Spark provides an interface for
programming entire clusters with implicit data parallelism and fault tolerance.
Figure 2.4: Apache Spark Ecosystem [21]
As figure 2.4 Spark has several key components and libraries:
The central component of Spark is Spark Core, which provides the most basic
functions of Spark such as scheduling tasks, memory management, fault re-
12
CHAPTER 2. LITERATURE REVIEW
covery, interaction with storage systems... In particular, Spark Core provides
an API for defining RDD (Resilient Distributed Dataset), which is a set of
items distributed across nodes of the cluster and can be processed in parallel.
Spark SQL allows querying structured data through SQL statements. Spark
SQL can operate on various data sources such as Hive tables, Parquet, and
JSON.
Spark Streaming provides an API for easily processing streaming data.
MLlib provides many machine learning algorithms such as classification, re-
gression, clustering, collaborative filtering...
GraphX is the newest component in Spark. Its a directed multigraph, which
means it contains both edges and vertices and can be used to represent a wide
range of data structures. It also has associated properties attached to each ver-
tex and edge.
The key capabilities of Spark include:
Speed: Spark extends the popular MapReduce model to efficiently support
more types of computations, including interactive queries and stream pro-
cessing. It can run programs up to 100x faster than Hadoop MapReduce in
memory, or 10x faster on disk.
Ease of Use: Spark offers over 80 high-level operators that make it easy to
build parallel apps. It also supports Java, Scala, Python and R APIs for devel-
oping applications. Spark shell provides an interactive environment.
Generality: Spark provides a unified engine that supports a wide range of
workloads including batch applications, iterative algorithms, interactive queries
and streaming. This eliminates the need to use different frameworks for differ-
ent workloads.
Compatibility: Spark can access diverse data sources including HDFS, Cas-
sandra, HBase and S3. It combines with all major Hadoop ecosystem tools.
Spark integrates closely with data processing platforms like Mesos, YARN
and Kafka.
The core abstraction Spark provides is a resilient distributed dataset (RDD)
which is a fault tolerant collection of elements that can be operated in parallel.
Spark can create RDDs from various input sources such as files in HDFS or by
transforming other RDDs.
RDDs support two types of operations - transformations that produce new datasets
13
CHAPTER 2. LITERATURE REVIEW
from existing ones and actions that return results to the driver program or write
data to external storage. Spark tracks the lineage of RDDs and can recover lost data
through lineage reconstruction in case of failures.
Spark runs as independent sets of processes coordinated by the SparkContext
object in the driver program. The driver program manages Spark jobs and divides
the data processing task into stages consisting of tasks that get executed across
worker nodes in a cluster.
Spark executes computations lazily only when a result needs to be returned to
the driver. This helps optimize the overall data flow and execution plan by scanning
only what data needs to be scanned. Spark’s use of in-memory caching of RDDs
vastly improves perfor mance over disk-based systems.
Spark runs on resources managed by standalone cluster manager, YARN, and
Mesos. It can access diverse workloads on Amazon EC2, Azure HDInsight and
Google Cloud Dataproc. For maximum performance, Spark should be run in mem-
ory using disks only for backups. Spark monitores execution and can automatically
re-execute failed tasks.
Spark is designed to be highly scalable and fault tolerant. It achieves high perfor-
mance through controlled partitioning to minimize network traff ic. Lineage track-
ing helps rebuild RDDs and caching improves performance of iterative algor ithms.
For developers, Spark offers rich APIs in Java, Scala, Python and R. It integrates
with popular frameworks like HDFS, YARN and provides hundreds of operators
for data transformations. Spark SQL makes it easy to abstract data processing in
schemas and queries.
Spark use cases span batch processing, machine learning, graph computation,
real-time analytics, ETL pipelines, data science, ad-hoc querying and more. It pow-
ers several enterprise workloads at large companies today across industries like
financial services, healthcare, retail, media, manufacturing etc.
Spark has rapidly become a widely adopted distributed big data processing en-
gine in the industry owing to its speed, ease of use and unified model. It br ings mas-
sive improvements over disk-oriented systems like Hadoop MapReduce by lever-
aging memory and optimized execution. With its strong community and growing
ecosystem, Spark is poised to be a long term platform for building fast, unified Big
Data applications.
In summary, Apache Spark is a fast and general distributed computing engine for
large-scale data processing. It delivers speed, ease of use and a unified architecture
14
CHAPTER 2. LITERATURE REVIEW
combining batch, streaming, machine learning and graph workloads. Spark’s in-
memory processing and optimized execution engine make it highly attractive for
interactive, iterative and real-time applications.
2.2.3 Apache Kafka
Kafka is a distributed streaming platform that is scalable and open-source. The
Kafka project was initially developed by LinkedIn and became an Apache open-
source project in 2011. Kafka is written in the Scala and Java programming lan-
guages. Its purpose is to provide a low-latency, high-throughput platform for pro-
cessing real-time data streams.
Figure 2.5: Apache kafka [22]
As figure 2.5 Kafka is built on a publish/subscribe model. Applications (acting
as producers) send messages (records) to a Kafka node (broker) and indicate that
these messages will be processed by other applications called consumers. The mes-
sages sent to the Kafka node are stored in a location called a topic, and consumers
can then subscribe to that topic and listen to these messages.
A topic can be seen as the name of a category where messages will be stored
and pushed into.
Zookeeper serves as a distributed key-value data store. It is optimized for fast
reads but slow writes. Kafka uses Zookeeper to perform the election of the leader
of Kafka broker and topic partition. Zookeeper is also designed for high fault tol-
erance, which makes Kafka heavily reliant on Zookeeper.
15
CHAPTER 2. LITERATURE REVIEW
2.3 Graph-based model
2.3.1 Graph Basics
A graph is a collection of points connected by lines. Points are referred to as
nodes, or vertices (plural of vertex. Connections are referred to as edges or ties.
a Basic Graph is shown in Figure 2.6 For example, in a friendship social graph,
Figure 2.6: Basic graph presentation
nodes are people and any pair of people connected denotes the friendship between
them.Depending on the context, these nodes are called nodes, or actors.In a web
graph, “nodes” represent sites and the connection between nodes indicates web
links between them.In a social setting, these nodes are called actors.
Edges connect nodes and are also known as ties or relationships In a social set-
ting , where nodes represent social entities such as people, edges indicate internode
relationships and are therefore known asrelationships or (social) ties.
A graph G = (V, E) as follows:
V the vertex set consists of nodes
E the edge set consists of the relation between Nodes. It is directed edges.
e
ij
=(v
i
,v
j
) represents edges e E connecting node v
i
to node v
j
of the graph.
Node features: Node behavior and Node profile
Edge features: features between two nodes
16
CHAPTER 2. LITERATURE REVIEW
Directed graph and undirect graph: Undirect graph or undirected matrix, when the
edge connecting two vertices i and j is the same, or e
ij
= e
j i
. Direct graph or di-
rected matrix, whose dimension is deter mined from vertex v
i
to v
j
edge connection
exists e
ij
.
There are several graph presentation . Firstly, Adjacency matrix (adjacency ma-
tr ix) A, is a square matrix of size nxn (with n being the total number of nodes in
the graph)=1. A
ij
=1 if e
ij
belongs to E. A
ij
= 0 if e
ij
belongs to E. Adjacency ma-
tr ix (A) is also called 1 weighted-matrix. A feature description x
i
for every node
i; summarized in a N × D feature matrix X (N : number of nodes, D: number of
input features). A representative description of the graph structure in matrix form;
typically in the form of an adjacency matrix A
The degree matrix (D) is an n x n square diagonal matrix that contains degree
information for each vertex. Note that with a directed g raph, the degree of each
node only counts the edges that connect in the direction of that node.
The identity matrix (I) is an n x n diagonal matrix, where values on the main
diagonal are equal to 1 and the rest are equal to 0.
Other graph representations include the edge list and adjacency list. In an edge
list, each element represents an edge and is usually represented as (u, v), denoting
that node u is connected to node v via an edge. On the other hand, in an adjacency
list, maintaining a list of all the nodes that each node is connected to. This list is
usually sorted based on the node order or other preferences.
2.3.2 Real-life Graph Data and application
In practice, numerous problems require the use of graph data as figure 2.7. For
instance:
Analyzing social network data to gain insights into current community trends
and customer groups.
Providing recommendations for making friends and following pages on popu-
lar platforms like Facebook.
Providing recommendations for making friends and following pages on popu-
lar platforms like Facebook.
Providing recommendations for making friends and following pages on popu-
lar platforms like Facebook.
Constructing product recommendation systems for e-commerce websites based
on user interaction data.
17
CHAPTER 2. LITERATURE REVIEW
Figure 2.7: Real-life Graph Data [23]
As figure 2.8 Fundamental problems involving graph data encompass:
Node classification: This problem entails classifying each node on the graph
with corresponding labels. It is also the most popular problem in Graph Neural
Networks (GNNs).
Link Prediction: This problem involves predicting whether two nodes in a
network have a relationship or whether there is a new connecting edge between
them.
Clustering Community Detection: This problem revolves around community
clustering or graph clustering.
2.3.3 Graph-based Embedding
Graph-based Embedding is divided into 2 main subgroups:
Vertex Embedding (Node embedding): or mapping a node in the graph to a
different latent space with D-dims. I can absolutely use these latent spaces
for visualization purposes, or apply them to downstream tasks such as: node
classification, graph clustering, . . .
Graph Embedding: similar to above, but is the mapping of an entire graph /
graph or sub-graph into a single vector, for example, mapping to the latent
space of molecular structures to compare with each other. This mapping will
be closely related to graph/sub-graph classification problems.
Many graph-based learning algorithms (graph-based embedding, graph neural
18
CHAPTER 2. LITERATURE REVIEW
Figure 2.8: Graph Data Application [23]
network) in general are based on the assumption that nodes close to each other
will have similar feature characteristics. For example, who you follow on twitter
will partly help us guess what issues that user is interested in on social networks,
maybe academic-related issues such as: machine learning, deep learning; or polit-
19
CHAPTER 2. LITERATURE REVIEW
ical, religious, ethnic issues by following related users, etc. From there, developers
will rely on those relationships to design models for their own purposes, For exam-
ple: social network analysis, recommend engine,. . .
Two nodes that are not directly connected to each other can still have the same
characteristics. The most typical example is in the problem of collaborative filter-
ing in recommend systems. When two users A and B are completely unrelated, user
A likes products P1, P2, P3, P4; User B likes products P1, P3, P4. So it is likely
that the buying habits of the two users are quite similar and the system will suggest
to user B product P2,. . . Or for example, as shown below, buttons 5 and 6 do not
connect directly but share many “neighboring” nodes. So it is assumed that these
two nodes will be quite similar in terms of context
Vertex Embedding (Node embedding) aims to map nodes in the graph to a new
vector space, with neighboring nodes with similar contexts staying close together.
Several popular techniques for vertex embedding include:
Random Walk: This technique involves randomly traversing neighboring nodes
in the graph, including the possibility of returning to the previous node. By
sampling using a random walk, I transform the data from a more complex
structure called a graph (with many interconnected nodes and edges) into a
1D sequence representation, similar to consecutive letters in a sentence.
Node2Vec: Node2Vec is another Node Embedding model based on the ideas
of DeepWalk and Word2Vec. The key difference of Node2Vec is that, in ad-
dition to using random walk as usual, the model introduces two additional
parameters P and Q to adjust the random walk on the graph.
Node embedding has the following disadvantages. Firstly, the method described
above is not suitable for graphs that are frequently updated, as it cannot handle
cases where nodes appear for the first time. Secondly, this method cannot be ap-
plied to node and edge characteristics for prediction problems.
Graph neural network models have been studied for quite a long time. Recently,
they have received more attention from the community and are divided quite clearly
into 2 main subclasses:
Spectral Graph Neural Network: This subclass is related to the concepts of
matr ix decomposition, such as eigen-decomposition, eigenvector, eigenvalues,
etc. However, spectral-based methods often have quite large computational
costs and are gradually being replaced by spatial-based methods.
Spatial Graph Neural Network: This subclass is a simpler method both in
20
CHAPTER 2. LITERATURE REVIEW
ter ms of mathematics and modeling. The spatial-based method is based on
the idea that building embedding nodes depends on neighboring nodes.
2.3.4 Graph Neural Network
[10] mention to Graph convolution neural . For the node embedding, similar-
ity in the embedding space approximates similarity in the graph. In this regard,
produces a node-level output Z (a N × F feature matrix, where F is the number
of output features per node). Graph-level outputs can be modeled by introducing
some form of the pooling operation. A neural network layer can then be written as
a non-linear function:
H
(l+1)
= f(H
(l)
, A) (2.1)
f(H
(l)
, A) = σ(AH
(l)
W
(l)
) (2.2)
with H
(0)
= X and H
(L)
= Z, L being the number of layers. where W
(l)
is a weight
matr ix for the l-th neural network layer and σ is a non-linear activation function like
the ReLU. Technically, the main idea is to generate node embedding based on local
network neighborhoods. The Nodes aggregate information from their neighbors
using neural networks. In this regard, the propagation process can be updated as
follows [10]:
f(H
(l)
, A) = σ(
ˆ
D
1
2
ˆ
A
ˆ
D
1
2
H
(l)
W
(l)
) (2.3)
where
ˆ
A = A + I, I represents the identity matrix, and
ˆ
D is the diagonal node
degree matrix of
ˆ
A. Network neighborhood defines a computation graph. Every
node defines a computation graph based on its neighborhood as shown in figure
2.9. The traditional Graph Convolutional Network (GCN) model has several limi-
tations, which are as follows:
Memory Requirement: The model’s weights are still updated each epoch, but
each epoch is updated according to full-batch gradient descent, not mini-batch
gradient descent. This means that all data points are updated at the same time.
This approach is understandable because, in the updated formula above, the
model must keep all the weights and adjac ency matrix A. However, with a
larger dataset with millions of nodes and dense-adjacency matrix, this ap-
proach is completely inappropriate when the memory requirement is very
large.
Directed Edges and Edge Features: The current GCN model published in the
paper does not use other factors such as edge feature (adj matrix A is currently
just a binary matrix) and directed graph (i.e., directed matrix). Processing
direction in paper is limited with undirected graph (scalar matrix).
21
CHAPTER 2. LITERATURE REVIEW
Figure 2.9: Node Embedding in GCN
Transductive Setting: With new nodes added to the graph (with new links),
the GCN model has very poor generalization ability with those new nodes and
requires re-training to update the model again.
Following by [24], GraphSage is an inductive learning method, which means
it has better generalization ability with unseen data. It is still based on the idea
of generating embedding nodes based on neighboring nodes. In the GraphSage
paper, the authors mentioned the design of aggregate functions to aggregate infor-
mation from neighboring nodes and proposed three corresponding aggregate func-
tions.Grap hSage uses mini-batch update gradient descent and is a spatial GNN
method, overcoming the biggest limitation of GCN, which is updating according
to full-batch gradient descent.
h
0
v
x
v
fork {1, 2, ..., K}do
forv V do
h
k
N
v
AGGREGATE
k
(h
k1
u
, u N(v));
h
k
v
σ(W
k
concat(h
k1
v
, h
k
N
v
))
end
h
k
v
= h
k
v
/||h
k
v
||
2
, v V
end
z
v
h
k
v
AGGREGATE function must operate on a set of unordered neighbour node vec-
tors of each node v. Common choices includes Mean aggregator, Pooling aggrega-
22
CHAPTER 2. LITERATURE REVIEW
tor, LSTM aggregator (random permutation of neighbours). The final loss func-
tions is calculated in a unsupervised settings. Positive neighbour v is the node that
co-occurs within fixed-length random walk of each node v. Negative neighbour is
sampled from distribution of p
n
(v). The final loss function of GraphSage is calcu-
lated as J
z
u
, which is similar to NCE noise contrastive loss, where similar items
pairs have higher values while unrelated items pairs have lower values.
Graph data is not ordered or relative in position like data types such as se-
quence or image. Therefore, it is assumed that the defined aggregator functions
must also be symmetric, meaning they are less sensitive to permutations of neigh-
bor ing nodes1. The paper mentions the use of three aggregator functions: mean,
pooling, and LSTM.
The mean aggregator is a non-parametric and symmetric function that simply
averages the vector of neighboring nodes at each location or performs an element-
wise mean operation. Performing a 2-vector concatenation on the pseudo code
above is almost similar to a “skip connection” in a residual network1.
On the other hand, the LSTM aggregator is a parametric function. LSTM is
designed for sequential problems, which are not symmetric. However, the paper
mentions using random permutations from neighboring nodes as input. The results
obtained are also quite positive compared to other aggregator functions
The GraphSage loss function is defined as follows:
J
z
u
= log(σ(z
T
u
z
v
)) QE
v
n
p
n
(v)
log(σ(z
T
u
z
v
n
)) (2.4)
The loss calculation is based on neighboring nodes, depending on the number of
steps in a random walk. This approach helps the model to be more generalizable
across the data, even when applied to unseen nodes. It is completely different from
fixed embedding training for each node, as seen in node embedding models like
DeepWalk or Node2Vec. This distinction represents the most significant differ-
ence between transductive learning models (DeepWalk, Node2Vec) and inductive
learning models like GraphSage.
The paper [25] presents an attention-based graph neural network model for
semi-super vised classif ication on a graph. The researchers show that the method
consistently outperforms competing methods on the standard benchmark citation
network datasets. Additionally, it demonstrates that the learned attention provides
interesting insight on how neighbors influence each other. During training, they
experimented with more complex attention models.
23
CHAPTER 2. LITERATURE REVIEW
The graph attentional layer(GAT) is used to model the graph propagation. In
each layer, node i has attention on all the other nodes j, and the attention coefficient
is calculated. The final softmax attention calculation is based only on the set of
neighbor ing nodes N
i
of each node i.
h={,,...,},
R
F
W R
F ×F
e
ij
= a(W h
i
, W h
j
)
k N
i
, neighbourhood nodes
a
ij
= softmax
j
(e
ij
) =
exp(e
ij
)
P
k∈N
i
exp(e
ik
)
The papers [26] and [27] propose a new framework for graph neural network
models that can more sufficiently exploit edge features, including those of undi-
rected or multi-dimensional edges. The proposed framework consolidates current
graph neural network models such as graph convolutional networks (GCN) and
graph attention networks (GAT). The proposed framework and new models have
the following novelties: First, they propose to use doubly stochastic normalization
of graph edge features instead of the commonly used row or symmetric normaliza-
tion approaches used in current graph neural networks. Second, they construct new
formulas for the operations in each individual layer so that they can handle multi-
dimensional edge features. Third, for the proposed new framework, edge features
are adaptive across network layers. The proposed new models obtain better perfor-
mance than the current state-of-the-art methods, i.e., GCNs and GAT, which testify
to the importance of exploiting edge features in graph neural networks.
24
CHAPTER 3. PROPOSED SOLUTION
3.1 Overview
The research focuses on developing a solution to predict subscribers who engage
in spam activities using the Graph Convolutional Network (GCN) model. The re-
search encompasses the following main steps:
Establishing a comprehensive data processing flow that synthesizes subscriber
behavioral character istics from telecommunications data. Constructing a real-
time user survey and call labeling stream to accurately identify regular sub-
scr ibers and spam subscribers.
Creating an aggregated dataset that fully characterizes spam behavior among
subscr ibers. This dataset serves as the foundation for modeling telecommuni-
cations data in the form of graph data, facilitating model training.
Designing the GCN model architecture to predict spam subscribers based on
the constr ucted dataset. The performance of this model will be compared
against baseline models, including Support Vector Machines (SVM), XG-
Boost, and Artificial Neural Networks (ANN).
3.2 Mobile Network Data Aggregation
In this section, we briefly about mobile network data flow and how we collect
our dataset.
3.2.1 System Architecture
The Figure 3.1 describes system architect design. The Design includes 4 main
modules:
Encrypted module: Gather Call Detail Record (CDR) log from the Gateway
Mobile Switching Center (GMSC) server. The module implements a symmetric-
key algorithm - Data Encryption Standard (DES) to encrypts the mobile num-
ber in the CDR so that the data log is anonymous.
Big Data Module: CDR logs are stored as raw data in the Big-Data module. A
distributed file system (DFS) that is distr ibuted on multiple servers to achieve
high scalability and high performance is implemented. Extract-Transform-
Load (ETL) batch jobs that implemented by Apache Spark, synthesize fea-
tures for each mobile number. After the feature engineering process, the user
and relationship feature data tables are stored in Hive tables.
AI modules: Training and optimizing call spam subscriber prediction models.
25
CHAPTER 3. PROPOSED SOLUTION
Including baseline models such as SVM, XGBoost, ANN... and Graph neural
network models.
Realtime-processing Module: Receive real-time call signals from GMSC server.
If the call comes from a subscriber suspected of spam, the Module sends flash
messages to survey customers through the Flash message gateway.
Figure 3.1: System Architect
3.2.2 Collective spam labelling by end-users
In this section, we describe the technique to label the data set into two classes,
consisting of spam and regular user. End-users investigation is the feasible ap-
proach to label a call as spam. Our research processed three techniques, including
flash messages, SMS messages, and customer-service calls. Overall, flash message
techniques gave the best result. On average, 3% investigated users responded to the
flash messages. While others are less than 1%. The reason is that end-users re-
ceived flash messages as soon as terminated calls. It is comfortable for them to
make the answers. Therefore, our research used flash messages to label the investi-
gated mobile numbers. In the first step, we prepare a blacklist containing potential
spam numbers using strict rule-based approaches. Specifically, the feature list in-
cludes:
total-call-out: number of calls that users make in a day.
total-duration-out: Duration of calls that users make in a day.
count-distinct-msisdn-contact: Number of contacts that users make a call or
send a message to in a month.
26
CHAPTER 3. PROPOSED SOLUTION
num-day-active: number of days when subscribers are active in the telephony
network.
Figure 3.2: Survey flash message
Sequentially, the suspected list is pushed into the Label-collection module to send
27
CHAPTER 3. PROPOSED SOLUTION
messages to mobile users in the second step as figure 3.2 . Each user is asked a
question: “Does this call bother you?” The module is based on feedback to decide
whether a call is spam or not. In fact, most of investigated mobile numbers received
both spam and regular responses. To clarify the label to the mobile number, we
define an assumption of label selection for the spam detection problem in this study
as follows: mobile users with a ratio of spam feedback to normal one that is more
than three are considered spammers.
3.2.3 Data flow Diagram
Figure 3.3: Data Flow Diagram
The figure 3.3 shows the call spam data flow diag ram, including 5 main flows
as below:
Encrypt CDR logs: the input is Raw CDR logs from GMSC Server. The output
is Encrypted CDR logs that are stored in Big data module
Feature Synthesis: the input is Encrypted CDR logs. the output is feature list.
The users features are sequentially described as follows:
Ratio of average successful outgoing calls to incoming calls, the average
number of outgoing calls per period (hour, day, year)
Ratio of calls with short duration to the total number of calls.
Ratio of the number of calls with a short time between calls (the time
28
CHAPTER 3. PROPOSED SOLUTION
inter val between two consecutive outgoing calls) to the total number of
calls.
Ratio of successful calls to total outgoing calls.
Relation features are input to create a graph such as a ratio of calls to
non-relational numbers (never called before) to total outgoing calls, rate
of contact only call one time out of the total contact.
Label Synthesis: the input is survey log. The output is labeled mobile numbers
list .
Model optimization: The inputs are feature list and labeled users. Trained
models are stored in Model storage before upgrading to production models.
Collective spam labelling by end-users: The GMSC server send realtime call
signal to Kafka server. The real-time processing module loads call signal and
spam suspected user from cache memory, checks survey condition and send
sur vey flash messages to mobile user.
3.3 Graph Structure-based Telephone Data Representation
The main idea is that exploit the telephony graph in spam detection. For ex-
ample, regular mobile numbers make a long-lasting relationship in their groups,
and spam ones are not. Therefore, we aim to construct a telephony graph in which
the nodes are mobile numbers, and the edges are the relationships between these
numbers. In this regard, we formalize our telephony network graph G = (V, E) as
follows:
V the vertex set consists of mobile-user nodes (subscriber data)
E the edge set consists of the relation between mobile users. It is directed
edges, from calling number to called number (relation data)
Node features: user behavior and user profile
Edge features: Call features between two numbers
To construct the g raph, we implemented edges by random walk algorithm [28].
Starting with initial nodes that are spam mobile users located by the process in
section 3.2.2. Afterward, we inserted randomly nodes representing mobile numbers
that received calls that have a duration of more than 0 seconds. The stop condition
is that the inserted nodes reach a second in depth. Figure 3.4 visualize a telephony
network as a graph structure.
29
CHAPTER 3. PROPOSED SOLUTION
Figure 3.4: A telephony network graph that is implemented by Random walk Algorithm
3.4 Spam call detection using GNN
In this section, we will discuss how we model the problem in terms of node
prediction. Specifically, sub-section 3.4.1 discusses the problem definition of GNN
for spam call detection problems. The sub-section ?? describes the proposed GCN
architecture to solve the problems, which is regarded as a node classification task.
3.4.1 Graph neural network for Spam call detection
Previous works used ML models for the classification such as SVM, Logistic re-
gression, Random forest, XGBoost, and so on. The prediction of spam subscribers
through subscriber features related to call and SMS message behavior. Specifi-
cally, those aforementioned ML models work well with a typical case like robot
calls. However, the number of spam subscribers with similar behavior to nor mal
subscr ibers is increasing, which takes the mobile network more complicated. For
instance, sellers use the phone for the advertised goods and also use it for their daily
life. Furthermore, there are legitimate subscribers with similar spam behavior such
as shippers, and customer care subscribers. These subscribers also call many dif-
ferent customers. With the above cases, the classical classification models give low
performances.
Alternatively, a new approach in this study is to model the relationship between
subscr ibers in the graph structure. In particular, the classification of subscribers is
not only based on its own features but also on the features of related subscribers.
Consequentially, the spam call detection data can be represented as a graph struc-
ture (directed graph), and use the GNN-based model for the classification.
30
CHAPTER 3. PROPOSED SOLUTION
Figure 3.5: Node Embedding in GNN with Edge feature
3.4.2 Proposed Model
As shown in figure 3.5, The study presents a graph neural network (GNN) model
for spam user detection, where each subscriber is a node in the graph. The charac-
ter istics of a node are aggregated from neighboring nodes using the GNN model
presented in section Data processing flow.
The study uses level 2 in the specific model, which means that the characteris-
tics of a node are aggregated from the level 2 and level 1 neighbors of that node.
This is useful for the prediction model of the spam call problem because predict-
ing spam subscribers is not only based on the behavior of this subscriber itself but
also through the characteristics of the subscribers they contact. For example, reg-
ular subscribers often communicate with each other. Another case is that regular
subscr ibers who often receive spam calls are also likely to receive other spam calls.
In addition, the proposed model integrates the character istics of edges in the
graph. Each edge, here is the call contact, has different characteristics such as call
length, call direction, and call frequency. The characteristics are aggregated to the
target node to carry information about the contact characteristics between neighbor
nodes. This helps the prediction model have more information in decision making.
The following formula represents the calculation of node embedding with edge
and user features. Here, h
v
is the embedding of user V; x
v
and x
w
are the features
of users v and w, respectively. In particular, e
wv
is the characteristic of the edge
between users v and w. The features of users and edges are described in the section
3.2. With the aggregate function γ used similarly to GCN, GAT, GraphSAGE, we
obser ve that the embedding of a user includes its own feature information, infor-
31
CHAPTER 3. PROPOSED SOLUTION
mation about related users, and the relationship itself.
h
v
= γ(x
v
,
X
wN (v)
ϕ(x
v
, x
w
, e
vw
)) (3.1)
The proposed GNN model structure is shown in Figure 3.6. Accordingly, the
model contains two GNN layers and one dense layer for classification The GNN
Figure 3.6: The proposed GNN Architecture
model takes two inputs, the node features matrix and adjacency matrix, and gives
the spam-labeled graph. Because the spam label node number is significantly smaller
than the regular one, some techniques are applied to remove the imbalance of the
data set, such as oversampling. Besides, the f1 score, precision, and recall are esti-
mated to benchmark the classif ication GNN model.
32
CHAPTER 4. EXPERIMENTS AND RESULTS
4.1 Big data Framework Setting
As Figure 4.1, below frameworks are implemented to handle big data and high
performance requirement:
Apache Kafka: Messsage queue receives real-time call signal
Redis: Memory cache contains rule and blacklist to send sur vey flash mes-
sages.
Apache Hadoop: Distributed storage contains CDR logs and feature data for
model .
Apache Spark: Parallel processing framework handle feature engineering tasks.
MLlib, Sklearn, Pytorch: AI and ML framework for training and implement-
ing classification models.
Figure 4.1: System Architecture Implementation
The laboratory server configuration is shown in the table 4.1, including of Na-
menode, Datanode, Kafka, Database and Business Servers.
To install Big data Cluster, the Horton work framework HDP-v.3.1.4.0-315 is
used along with the Apache Ambari framework to manage resources and Big data
33
CHAPTER 4. EXPERIMENTS AND RESULTS
Table 4.1: Server resource Lab
No Module name Quantity CPU(core) Ram(GB) HDD(GB)
1 Name Node 3 8 16 300
2 Data Node 3 16 32 500
3 Kafka 3 8 8 200
4 Database 2 8 8 200
5 Business mod-
ule
2 8 8 200
components. Apache Hadoop, Spark, and Kafka components is managed on Am-
bar i’s interface as figure 4.2.
Figure 4.2: Big data cluster on Apache Ambari dashboard
4.2 Feature Engineering
After the raw data is collected, the feature aggregation flows are perfor med by
batch jobs in spark. Streams are optimized through parameters used such as ex-
ecutor, RAM, core. The table below shows the optimization of a batch job with 22
million - rows data set.
Batch jobs in the optimization process are shown in the figures 4.3,4.4,4.5,4.6.
The excution time decrease from 556(s) to 288(s) when I increase number of execu-
tors, excutor cores and excutor core memory to repectively 16, 8 and 4(MB) for 1
batch job. With the configuration, feature engineer tasks are optimized processing
time.
34
CHAPTER 4. EXPERIMENTS AND RESULTS
Table 4.2: Batch parameters Optimization
Number of
executors
number of
excutor cores
excutor core
memory (Mb)
excution
time (s)
aggregated re-
source allocation
(Million Mb -
Seconds)
16 8 4 288 38
8 8 4 312 20
4 8 4 395 13
2 8 4 556 9
4.3 Dataset
The custom dataset includes 5.119 nodes(3.069 spam nodes and 2050 non-spam
lab nodes) and 1.709 edges. A node contains 142 features and the edge contains
9 features, respectively. The training set contains 3.563 nodes(2.156 spam nodes
and 1.407 non-spam nodes). The test set contains 1.556 nodes (913 spam nodes
and 643 non-spam nodes). The feature list is shown in table 4.3. Appendix present
more detailed the list. Calculating features weight to specify the feature importance
in model as figure 4.7
The most important features are num_1way_callout, num_contacts_telecom_out,
num_calls_out, count_distinct_msisdn_out , total_call_out...
4.4 Model Environment settings
4.4.1 Machine learning framework Installation
a, Scikit-learn
Scikit-learn (Sklearn) is a powerful library for machine learning algorithms
wr itten in Python that provides a set of tools to handle machine learning and sta-
tistical modeling problems, including classification, regression, clustering, and di-
mensionality reduction. The library is licensed under the FreeBSD standard and
can run on many Linux platforms 1. Scikit-learn is used as a learning resource.
To install scikit-learn, one must first install the SciPy (Scientific Python) library.
The ingredients include Numpy, SciPy, Matplotlib, IPython, SymPy, and Pandas 2.
SciPy extension libraries are often named SciKits. This library is a package of
classes and functions used in machine learning algorithms named scikit-learn.
Scikit-learn provides strong support in building products. This means that this
library focuses deeply on building the following elements: easy to use, easy to
code, easy to reference, easy to work with, and highly effective.
Although written for Python, scikit-learns platform libraries are actually written
35
CHAPTER 4. EXPERIMENTS AND RESULTS
Table 4.3: Feature List
Group Description Feature example
Call-out fea-
tures
Statistical features related to
call-out behavior in 1 day,
15days and 30 days
total-call-out, count-distinct-msisdn-
call-out, percentile-total-call-out-50
1-way-
call-out
features
Statistical features related to
cases in which callers make
call to listener with no reverse
direction in 30 days
total-1-way-call-out, count-distinct-
msisdn-1way-call-out, total-1-way-
duration-out
Duration-
call-out-
per-call
features
Statistical features related to
call duration that mobile num-
ber make in 30 days
total-duration-out, percentile-
duration-call-out-25, percentile-
duration-call-out-50, percentile-
duration-call-out-75, max-duration-
call-out
Call-in fea-
tures
Statistical features related to
calls that mobile numbers re-
ceive in 30 days
total-call-in, count-distinct-msisdn-
call-in, total-duration-in, percentile-
total-call-in-50
1-way-call-
in features
Statistical features related to
cases in which listeners re-
ceive a call with no reverse di-
rection in 30 days
total-1-way-call-in, count-distinct-
msisdn-1way-call-in, total-1-way-
duration-in, percentile-total-call-in-
1way-50
Duration-
call-in-
per-call
features
Statistical features related to
call duration that mobile num-
ber receive in 30 days
percentile-duration-call-in-50,
max-duration-call-in, min-duration-
call-in, avg-duration-call-in, std-
duration-call-in
Sms-out
features
Statistical features related to
sms messages that subscribers
send in 30 days
total-sms-out
1-way-
sms-out
features
Statistical features related to
sms messages that subscribers
send with no reverse direction
in 30 days
total-1-way-sms-out, count-distinct-
msisdn-1way-sms-out
Sms-in fea-
tures
Statistical features related to
sms messages that subscribers
receive in 30 days
total-sms-in, count-distinct-msisdn-
sms-in
1-way-
Sms-in
features
Statistical features related to
sms messages that subscribers
receive with no reverse direc-
tion in 30 days
total-1-way-sms-in, count-distinct-
msisdn-1way-sms-in
Relation
features
Statistical features related to
relation that mobile number
contact to others
count-distinct-msisdn-contact, count-
distinct-msisdn-1way-contact, ratio-
1way-msisdn-contact
Active
behavior
features
Statistical features related to
active behavior of users on
mobile network
num-day-active, set-day-call-out, set-
day-call-in, set-day-sms-out, set-day-
sms-in, set-day-active
36
CHAPTER 4. EXPERIMENTS AND RESULTS
under C libraries to increase performance. Examples are Numpy (Matrix Calcula-
tion), LAPACK, LibSVM, and Cython
b, PyTorch
PyTorch is an open-source machine learning library for Python that was devel-
oped by Facebook and first publicly released in 2016 . It is used for applications
such as natural language processing. PyTorch redesigns and implements Torch in
Python while sharing the same core C libraries for back-end code. PyTorch devel-
opers have adapted this back-end code to run Python efficiently. They also kept
the GPU-based hardware acceleration and extensibility features that made Torch
Lua-based.
PyTorch is famous for being more widely used in research than in production.
However, since its release the year after TensorFlow, PyTorch has seen a sharp
increase in usage among professional developers. Because Python programmers
found it so natural to use, PyTorch quickly gained traction, inspiring the Tensor-
Flow team to adopt many of PyTorchs most popular features in TensorFlow 2.0.
PyTorch provides an easy-to-use API, making it simple to operate and run on
Python 2. This librar y is considered Pythonic, which integrates seamlessly with the
Python data science stack. Therefore, it can take advantage of all the services and
functions provided by the Python environment. PyTorch provides an excellent com-
putational graph feature that allows users to define and optimize dynamic graphs
of computations.
4.4.2 Based Model Hyper parameter
We select Random forest, SVM, XGBoost, and ANN as the baseline models for
the comparison, which follows the work in [9] for the Prevent Malicious Calls over
Telephony Networks. Specifically, we applied the Sklearn library to build the Ran-
dom forest, SVM, XGBoost, and ANN models. In particular, the hyperparameter
configurations are shown in Table 4.4.
4.4.3 GNN Hyper parameter Setting
We used the Pytorch library to implement the GNN model following [29]. The
model includes two GNN layers and one dense layer. The number of epochs is set
to 300. The hyper parameters of our model are shown in table 4.5.
4.5 Evaluation st andard
Accuracy is the simplest and most commonly used method for evaluating ma-
chine learning models. This evaluation calculates the ratio between the number of
correctly predicted points and the total number of points in the test dataset.
37
CHAPTER 4. EXPERIMENTS AND RESULTS
Table 4.4: Based Model Hyper Parameter
Model Hyper Parameter
Random forest test_size_equal:0.3, random_state_equal:42
max_depth:10.25, min_samples_leaf:9
min_samples_split:6, random_state :42
SVM C:0.985, degree:3, gama:0.549
kernel: linear, random_state:42
XGBoost colsample_bylevel:0.45, colsample_bytree:0.45
learning_rate:0.275, max_depth:5
random_state:42, sub_sample:0.6
min_child_weight:0.12, n_estimators :320
ANN drop_out : 0.36, ann_activation: tanh
Lr: 0.02, momentum: 0.86, optim: adam
Table 4.5: The hyper parameters of the proposed GCN model.
Hyper parameter Value
drop_out 0.36
gnn_activation tanh
gnn_hidden_dim 120
Lr 0.02
momentum 0.86
optim adam
gcn_layer_number 2
However, accuracy alone does not provide information on how each type is clas-
sified, which class is classified correctly the most, and data belonging to one class
is often misclassified into another class. To evaluate these values, we use a matrix
called a confusion matrix.
The confusion matrix represents how many data points actually belong to a class
and are predicted to fall into a class. It is a square matrix with each dimension equal
to the number of data layers. The value in the ith row and jth column is the number
of points that should belong to class i but are predicted to belong to class j.
For classification problems where the data sets of the classes are very different,
Precision-Recall is an effective operation that is often used. First, consider the bi-
nary classification problem. We also consider one of the two classes as positive and
the other class as negative. For one way of defining a class as positive, Precision is
defined as the ratio of true positive points among those classified as positive (TP +
FP). Recall is defined as the ratio of true positive points among those that are ac-
tually positive (TP + FN). Mathematically, Precision and Recall are two fractions
with equal numerators but different denominators.
38
CHAPTER 4. EXPERIMENTS AND RESULTS
Table 4.6: Model result
model Accuracy F1 score Precision Recall
Random Forest 0.9678 0.961 0.972 0.948
SVM 0.971 0.964 0.97 0.958
XGBoost 0.968 0.961 0.973 0.95
ANN 0.969 0.962 0.974 0.95
GNN 0.978 0.973 0.972 0.975
Besides, F1 score is a machine learning evaluation metric that measures a model’s
accuracy. It combines the precision and recall scores of a model. The accuracy met-
r ic computes how many times a model made a correct prediction across the entire
dataset. This can be a reliable metric only if the dataset is class-balanced; that is,
each class of the dataset has the same number of samples. The F1 score combines
precision and recall using their harmonic mean, and maximizing the F1 score im-
plies simultaneously maximizing both precision and recall. Thus, the F1 score has
become the choice of researchers for evaluating their models in conjunction with
accuracy.
4.6 Main results
Model converge after 60 epochs, the train loss and val loss as figure 4.8. The
plot of learning curves shows a good fit. The plot of training loss decreases to a
point of stability. The plot of validation loss decreases to a point of stability and
has a small gap with the training loss. The loss of the model is always lower on the
training dataset than the validation dataset.
Table 4.6 shows model comparison results. The metrics are accuracy, f1 score,
precision, and recall. As the result, the GNN model gives the highest Accuracy,
f1 score, and recall. In particular, GNN f1 score achieves 0.973. Meanwhile, the
highest baseline model f1 score is 0.964 with SVM. The GNN accuracy and recall
are 0.978 and 0.975, respectively. The GNN precision which is 0.972 is quite lower
than other models. However, it is acceptable to filter positive labels. The confusion
matr ices of baseline models and the proposed model are shown as figure 4.11 and
figure 4.10, respectively.
39
CHAPTER 4. EXPERIMENTS AND RESULTS
Figure 4.3: The first optimization Result
Figure 4.4: The second optimization result
40
CHAPTER 4. EXPERIMENTS AND RESULTS
Figure 4.5: The third optimization result
Figure 4.6: The fourth optimization result
41
CHAPTER 4. EXPERIMENTS AND RESULTS
Figure 4.7: Feature Importance
42
CHAPTER 4. EXPERIMENTS AND RESULTS
Figure 4.8: Train loss and val loss
Figure 4.9: Precision and recall calculation [30]
43
CHAPTER 4. EXPERIMENTS AND RESULTS
Figure 4.10: GNN Confusion matrix
Figure 4.11: Base Model Confusion Matrix
44
CHAPTER 5. CONCLUSIONS
5.1 Summary
The research introduces a novel approach to spam call detection, which consists
of three main contributions. Firstly, implementing a big data processing architec-
ture that uses telecommunications data to synthesize data features and determine
spam or normal calls. Secondly, constructing a telephony graph data set for spam
call problems that is able to exploit the relationship among users relationships in
more detail. Finally, proposing a graph neural network (GNN)-based model for the
spam detection task. Extensive experiments show that their framework can achieve
an F1 score of 0.973, which is better than a robust baseline model. The proposed
solution has been officially implemented with Viettel network’s filtering practices.
Besides, the paper named A novel method for spam call detection using graph
convolutional networks” was finally accepted for presentation at the 16th ACIIDS
(Asian Conference on Intelligent Information and Database Systems) conference
in 2023.
5.2 Suggestion for Future Works
The research will be furthered in two primary directions. Firstly, applying the
Graph Neural Network (GNN) model to larger data sets with over 50 million nodes
Secondly, applying an edge prediction model between nodes to predict calls and
relationships between subscribers.
45
REFERENCE
[1] J. D. Rosenberg and C. Jennings, “The session initiation protocol (SIP) and
spam, RFC, vol. 5039, pp. 1–28, 2008. DOI: 10.17487/RFC5039. [On-
line]. Available: https://doi.org/10.17487/RFC5039.
[2] J. Peterson and C. Jennings, Enhancements for Authenticated Identity Man-
agement in the Session Initiation Protocol (SIP), RFC 4474, Aug. 2006.
DOI: 10.17487/RFC4474. [Online]. Available: https://www.rfc-
editor.org/info/rfc4474.
[3] R. MacIntosh and D. Vinokurov, “Detection and mitigation of spam in ip
telephony networks using signaling protocol analysis, in IEEE/Sarnoff Sym-
posium on Advances in Wired and Wireless Communication, 2005., 2005,
pp. 49–52. DOI: 10.1109/SARNOF.2005.1426509.
[4] D. Lentzen, G. Grutzek, H. Knospe, and C. Porschmann, “Content-based de-
tection and prevention of spam over ip telephony - system design, prototype
and first results, in 2011 IEEE International Conference on Communica-
tions (ICC), 2011, pp. 1–5. DOI: 10.1109/icc.2011.5963108.
[5] B. Elizalde and D. Emmanouilidou, “Detection of robocall and spam calls
using acoustic features of incoming voicemails, in Proc. Mtgs. Acoust, ASA,
POMA, 2021, p. 060 004. [Online]. Available: https://www.microsoft.
com/en-us/research/publication/detection-of-robocall-
and-spam-calls-using-acoustic-features-of-incoming-
voicemails/.
[6] N. Chaisamran, T. Okuda, G. Blanc, and S. Yamaguchi, “Trust-based voip
spam detection based on call duration and human relationships, in 2011
IEEE/IPSJ International Symposium on Applications and the Internet, 2011,
pp. 451–456. DOI: 10.1109/SAINT.2011.84.
[7] Y. Kontsewaya, E. Antonov, and A. Artamonov, “Evaluating the effective-
ness of machine learning methods for spam detection, Procedia Comput.
Sci., vol. 190, no. C, 479–486, 2021, ISSN: 1877-0509. DOI: 10.1016/j.
procs.2021.06.056. [Online]. Available: https://doi.org/10 .
1016/j.procs.2021.06.056.
[8] S. M. Gowri, G Sharang Ramana, M Sree Ranjani, and T Tharani, “De-
tection of telephony spam and scams using recurrent neural network (rnn)
algor ithm, in 2021 7th International Conference on Advanced Computing
and Communication Systems (ICACCS), vol. 1, 2021, pp. 1284–1288. DOI:
10.1109/ICACCS51430.2021.9441982.
46
REFERENCE
[9] H. Li, X. Xu, C. Liu, et al., “A machine learning approach to prevent mali-
cious calls over telephony networks, in 2018 IEEE Symposium on Security
and Privacy (SP), 2018, pp. 53–69. DOI: 10.1109/SP.2018.00034.
[10] T. N. Kipf and M. Welling, “Semi-supervised classification with graph con-
volutional networks, CoRR, vol. abs/1609.02907, 2016. arXiv: 1609.02907.
[Online]. Available: http://arxiv.org/abs/1609.02907.
[11] E. Schooler, J. Rosenberg, H. Schulzrinne, et al., SIP: Session Initiation Pro-
tocol, RFC 3261, Jul. 2002. DOI: 10.17487/RFC3261. [Online]. Avail-
able: https://www.rfc-editor.org/info/rfc3261.
[12] J. Heo, T. Kusumoto, E. Y. Chen, and M. Itoh, “A statistical analysis method
for detecting mass call spam in sip-based voip service, in 8th Asia-Pacific
Symposium on Information and Telecommunication Technologies, 2010, pp. 1–
6.
[13] Y. Bai, X. Su, and B. Bhargava, “Detection and f iltering spam over internet
telephony a user-behavior-aware intermediate-network-based approach,
in 2009 IEEE International Conference on Multimedia and Expo, 2009, pp. 726–
729. DOI: 10.1109/ICME.2009.5202597.
[14] A. Kwong, J. H. Muzamal, and Z. Khan, “Privacy pro: Spam calls detection
using voice signature analysis and behavior-based filtering, in 2022 17th
International Conference on Emerging Technologies (ICET), 2022, pp. 184–
189. DOI: 10.1109/ICET56601.2022.10004692.
[15] H. Huang, H.-T. Yu, and X.-L. Feng, “A spit detection method using voice
activity analysis, in 2009 International Conference on Multimedia Informa-
tion Networking and Security, vol. 2, 2009, pp. 370–373. DOI: 10.1109/
MINES.2009.253.
[16] R. J. Ben Chikha, T. Abbes, and A. Bouhoula, “A spit detection algorithm
based on user’s call behavior, in 2013 21st International Conference on
Software, Telecommunications and Computer Networks - (SoftCOM 2013),
2013, pp. 1–5. DOI: 10.1109/SoftCOM.2013.6671851.
[17] H. Zhang and R. Dantu, “Opt-in detection based on call detail records,
in 2009 6th IEEE Consumer Communications and Networking Conference,
2009, pp. 1–2. DOI: 10.1109/CCNC.2009.4784920.
[18] P. Ravula, S. Kumar Ch, S. Gopisetty, H. Pedhamallu, V. K. Mishra, and T.
Badal, “Voip spam detection using machine learning, in 2022 6th Interna-
tional Conference on Intelligent Computing and Control Systems (ICICCS),
2022, pp. 1251–1258. DOI: 10.1109/ICICCS53718.2022.9788233.
47
REFERENCE
[19] S. Malhotra, G. Arora, and R. Bathla, “Detection and analysis of fraud phone
calls using artificial intelligence, in 2023 International Conference on Re-
cent Advances in Electrical, Electronics Digital Healthcare Technologies
(REEDCON), 2023, pp. 592–595. DOI: 10.1109/REEDCON57544.2023.
10150631.
[20] Vladimir Kaplarevic, Hypertext transfer protocol (HTTP). [Online]. Avail-
able: https://phoenixnap.com/kb/apache-hadoop-architecture-
explained (visited on 09/30/2023).
[21] cazton, Hypertext transfer protocol (HTTP). [Online]. Available: https:
//cazton.com/consulting/big-data-development/apache-
spark (visited on 09/30/2023).
[22] Raghav Tiwari, Hypertext transfer protocol (HTTP). [Online]. Available:
https://www.linkedin . com/pulse/how- deploy- kafka-
zookeeper-cluster-linux-based-operating-tiwari/ (vis-
ited on 09/30/2023).
[23] Stanford University, Hypertext transfer protocol (HTTP). [Online]. Avail-
able: http://cs224w.stanford.edu (visited on 09/30/2023).
[24] W. Hamilton, Z. Ying, and J. Leskovec, “Inductive representation learning
on large graphs, in Advances in Neural Information Processing Systems, I.
Guyon, U. V. Luxburg, S. Bengio, et al., Eds., vol. 30, Curran Associates,
Inc., 2017. [Online]. Available: https://proceedings.neurips.
cc/paper_files/paper/2017/file/5dd9db5e033da9c6fb5ba-
83c7a7ebea9-Paper.pdf.
[25] K. K. Thekumparampil, C. Wang, S. Oh, and L.-J. Li, Attention-based graph
neural network for semi-supervised learning, 2018. arXiv: 1803.03735
[stat.ML].
[26] Y. Yang and D. Li, “Nenn: Incorporate node and edge features in graph neu-
ral networks, in Proceedings of The 12th Asian Conference on Machine
Learning, S. J. Pan and M. Sugiyama, Eds., ser. Proceedings of Machine
Learning Research, vol. 129, PMLR, 2020, pp. 593–608. [Online]. Available:
https://proceedings.mlr.press/v129/yang20a.html.
[27] L. Gong and Q. Cheng, “Exploiting edge features for graph neural networks,
in 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition
(CVPR), 2019, pp. 9203–9211. DOI: 10.1109/CVPR.2019.00943.
[28] R. Li, J. X. Yu, L. Qin, R. Mao, and T. Jin, “On random walk based graph
sampling, in 31st IEEE International Conference on Data Engineering,
ICDE 2015, Seoul, South Korea, April 13-17, 2015, J. Gehrke, W. Lehner, K.
48
REFERENCE
Shim, S. K. Cha, and G. M. Lohman, Eds., IEEE Computer Society, 2015,
pp. 927–938. DOI: 10.1109/ICDE.2015.7113345. [Online]. Avail-
able: https://doi.org/10.1109/ICDE.2015.7113345.
[29] DGL Team, Hypertext transfer protocol (HTTP). [Online]. Available: https:
//docs.dgl.ai/en/0.8.x/api/python/nn-pytorch.html
(visited on 09/30/2023).
[30] M. Sokolova and G. Lapalme, “A systematic analysis of performance mea-
sures for classification tasks, Information Processing Management, vol. 45,
pp. 427–437, Jul. 2009. DOI: 10.1016/j.ipm.2009.03.002.
49
APPENDIX
50
Table 5.1: Feature of CDR log
Group Description Feature example
Call-out
features
Statistical fea-
tures related
to call-out be-
havior in 30
days
total-call-out, count-distinct-msisdn-call-out,
percentile-total-call-out-25, percentile-total-call-out-
50, percentile-total-call-out-75, max-total-call-out,
min-total-call-out, avg-total-call-out, std-total-call-
out
1-way-
call-out
features
Statistical fea-
tures related to
cases in which
callers make
call to listenner
with no reverse
direction in 30
days
total-1-way-call-out, count-distinct-msisdn-1way-
call-out, total-1-way-duration-out, percentile-total-
call-out-1way-25, percentile-total-call-out-1way-50,
percentile-total-call-out-1way-75, max-total-call-out-
1way, min-total-call-out-1way, avg-total-call-out-
1way, std-total-call-out-1way, percentile-duration-
call-out-1way-25, percentile-duration-call-out-1way-
50, percentile-duration-call-out-1way-75, max-
duration-call-out-1way, min-duration-call-out-1way,
avg-duration-call-out-1way, std-duration-call-out-
1way, ratio-1way-call-out, ratio-1way-msisdn-call-
out, ratio-1way-duration-out
Duration-
call-out-
per-call
features
Statistical fea-
tures related to
call duration
that mobile
number make in
30 days
total-duration-out, percentile-duration-call-out-25,
percentile-duration-call-out-50, percentile-duration-
call-out-75, max-duration-call-out, min-duration-
call-out, avg-duration-call-out, std-duration-call-out,
percentile-duration-call-out-25-per-call, percentile-
duration-call-out-50-per-call, percentile-duration-
call-out-75-per-call, max-duration-call-out-per-call,
min-duration-call-out-per-call, avg-duration-call-out-
per-call, std-duration-call-out-per-call
Call-in fea-
tures
Statistical fea-
tures related to
calls that mobile
numbers receive
in 30 days
total-call-in, count-distinct-msisdn-call-in, total-
duration-in, percentile-total-call-in-25, percentile-
total-call-in-50, percentile-total-call-in-75, max-
total-call-in, min-total-call-in, avg-total-call-in,
std-total-call-in
51
1-way-call-
in features
Statistical fea-
tures related
to cases in
which listenners
receive a call
with no reverse
direction in 30
days
total-1-way-call-in, count-distinct-msisdn-1way-
call-in, total-1-way-duration-in, percentile-total-
call-in-1way-25, percentile-total-call-in-1way-50,
percentile-total-call-in-1way-75, max-total-call-in-
1way, min-total-call-in-1way, avg-total-call-in-1way,
std-total-call-in-1way, percentile-duration-call-
in-1way-25, percentile-duration-call-in-1way-50,
percentile-duration-call-in-1way-75, max-duration-
call-in-1way, min-duration-call-in-1way, avg-
duration-call-in-1way, std-duration-call-in-1way,
ratio-1way-call-in, ratio-1way-msisdn-call-in, ratio-
1way-duration-in
Duration-
call-in-
per-call
features
Statistical fea-
tures related to
call duration
that mobile
number receive
in 30 days
percentile-duration-call-in-25, percentile-duration-
call-in-50, percentile-duration-call-in-75, max-
duration-call-in, min-duration-call-in, avg-duration-
call-in, std-duration-call-in, percentile-duration-
call-out-25-per-call, percentile-duration-call-out-50-
per-call, percentile-duration-call-out-75-per-call,
max-duration-call-out-per-call, min-duration-
call-out-per-call, avg-duration-call-out-per-call,
std-duration-call-out-per-call
Sms-out
features
Statistical fea-
tures related to
sms messages
that subcribers
send in 30 days
total-sms-out, count-distinct-msisdn-sms-out,
percentile-total-sms-out-25, percentile-total-sms-
out-50, percentile-total-sms-out-75, max-total-
sms-out, min-total-sms-out, avg-total-sms-out,
std-total-sms-out, count-distinct-msisdn-out, count-
distinct-msisdn-1way-out
1-way-
sms-out
features
Statistical fea-
tures related to
sms messages
that subcribers
send with no re-
verse direction
in 30 days
total-1-way-sms-out, count-distinct-msisdn-
1way-sms-out, percentile-total-sms-out-1way-25,
percentile-total-sms-out-1way-50, percentile-
total-sms-out-1way-75, max-total-sms-out-1way,
min-total-sms-out-1way, avg-total-sms-out-1way,
std-total-sms-out-1way, ratio-1way-msisdn-sms-out,
ratio-1way-sms-out, ratio-1way-msisdn-out
52
Sms-in fea-
tures
Statistical fea-
tures related to
sms messages
that subcribers
receive in 30
days
total-sms-in, count-distinct-msisdn-sms-in,
percentile-total-sms-in-25, percentile-total-sms-
in-50, percentile-total-sms-in-75, max-total-sms-in,
min-total-sms-in, avg-total-sms-in, std-total-sms-in,
count-distinct-msisdn-in
1-way-
Sms-in
features
Statistical fea-
tures related to
sms messages
that subcribers
receive with
no reverse di-
rection in 30
days
total-1-way-sms-in, count-distinct-msisdn-1way-sms-
in, percentile-total-sms-in-1way-25, percentile-total-
sms-in-1way-50, percentile-total-sms-in-1way-75,
max-total-sms-in-1way, min-total-sms-in-1way,
avg-total-sms-in-1way, std-total-sms-in-1way,
ratio-1way-msisdn-sms-in, ratio-1way-sms-in, count-
distinct-msisdn-1way-in, ratio-1way-msisdn-in
Relation
features
Statistical fea-
tures related
to relation that
mobile number
contact to others
count-distinct-msisdn-contact, count-distinct-msisdn-
1way-contact, ratio-1way-msisdn-contact
Active
behavior
features
Statistical fea-
tures related to
active behavior
of users on
mobile network
num-day-active, set-day-call-out, set-day-call-in, set-
day-sms-out, set-day-sms-in, set-day-active
53